Discovering Lexical Information by Tagging Arabic Newspaper Text
نویسندگان
چکیده
In this paper we describe a system for building an Arabic lexicon automatically by tagging Arabic newspaper text. In this system we are using several techniques for tagging the words in the text and figuring out their types and their features. The major techniques that we are using are: finding phrases, analyzing the affixes of the words, and analyzing their pattems. Proper nouns are particularly difficult to identify in the Arabic language; we describe techniques for isolating them. I N T R O D U C T I O N A lexicon is considered to be the backbone of any natural language application. It is an essential basis for parsing, text generation, and information retrieval systems. We cannot implement any of these applications or others in the natural language area without having a good lexicon. All natural language processing systems need a lexicon full of explicit information [Ahlswede and Evens, 1988; Byrd et al., 1987 McCawley, 1986]. The best way to find the necessary lexieal information, we believe, is to extract it automatically from text. We are developing a part-of-speech tagger for Arabic newspaper text. We are testing it on a corpus developed by Ahmad Hasnah [1996] based on text given to Illinois Institute of Technology, by the newspaper, Al-Raya, published in Qatar. The questions we address here are how to build an efficient techniques for automating the tagger system, what techniques and algorithms can be used in finding the part of speech and extracting the features of the word. When it comes to the Arabic language there are problems and challenges that are not present in English or other European languages. Newspaper articles are full of proper nouns that need special rules to tag them in the text, because the Arabic language does not distinguish between lower and upper case letters, which leave us with a big problem in recognizing proper nouns in Arabic text. The lack of vowels in the text we are using creates big problems of ambiguity. Different vowels change the word from noun to verb and from one t39e of noun to another; they also change the meaning of the word. For example, the following two words have the same letters with the same sequence but with different vowels. The result is different meanings. , ";~ k(a)t(a)b wrote ' "< k(u)t(u)b books Most published Arabic text is not vowelized with the exception of the Holy Quran and books for children. Some words in Arabic text begin with one, two, three, or four extra letters that constitute articles or prepositions. For example, the following word consists of two parts: the particle (a preposition letter) that is attached to the beginning of the noun while it is not part of it and the noun itself. (on occasion) ".-q~t~. -~ ~ + ".-q~tz, We need to identify these cases in the text and deal with them in a perceptive way. In this paper we are trying to find answers to these challenges through building a tagger system whose main function is to parse an Arabic text, tag the parts of speech, and find out their features to build a lexicon for this language. Three main techniques used in this system for tagging the words are: finding phrases (verb phrases, noun phrases, and proper noun phrases), analyzing the affixes of the word, and analyzing its pattern. 1. TAGGING VERB AND NOUN There are several signs in the Arabic language that indicate whether the word is a noun or a verb. One of them is the affix of the word: some of the affixes are used x~ith verbs; some of them are used with nouns; and some of them are used xvith verbs and nouns. A lot of research projects have used this technique to find the part of speech of a word. Andrei Mikheev [1997] used a technique for fully automatic acquisition of rules that guess possible part-of-speech tags for unknown words using their starting and ending segments. Several types of guessing rules are included: prefix morphological rules and suffix morphological rules. Zhang and Kim [1990] developed a system for automated learning of morphological word function rules. This system divided a string into three regions and inferred from training examples their correspondence to underlying morphological features. More advanced word-guessing methods use word features such as leading and trailing word segments to determine possible tags for unknown words. Such methods can achieve better performance, reaching a tagging accuracy of up to 85% on unknown words for English [Brill 1992; Weischedel et al., 1993]. Another sign that indicates whether a word is a noun or a verb is the pattern. In the Arabic language the patterns function as an important guide in recognizing the type of the word; some of these patterns are used just for nouns; some of them are usedjust for verbs; and others are used for both nouns and verbs. One more sign comes from grammatical rules; several grammatical rules can be used to distinguish between nouns and verbs, some letters in the Arabic language (letters of signification are similar to prepositions in the English language) mark the nouns; others mark the verbs 2. TAGGING PROPER NOUNS Constructing lexical entries for proper nouns is not less important than defining and analyzing common nouns, verbs, and adjectives for supporting natural language applications. The semantic categories of proper nouns are crucial information for text understanding [Wolinski et al., 1995] and information extraction [Cowie and Lehnert,1996]. They are also used in information retrieval systems [Paik et al.,1993]. A number of studies have shown the usefulness of lexical-semantic relationships in information retrieval systems [Evens et al., 1985; Nutter et al., 1990; Abu-Salem, 1992]. The lexicalsemantic relationships are also important in other applications like question-answering systems [Evens and Smith, 1978]. Rau [1991] argues that proper nouns not only account for a large percentage of the unkno~aa words in a text, but also are recognized as a crucial source of information in a text for extracting contents, identifying a topic in a text, or detecting relevant documents in information retrieval. Wacholder [1997] analyzed the types of ambiguity structural and semantic that make the discovery of proper names in the text difficult. Jong-Sun Kim and Evens [1995] built a natural language processing system for extracting personal names and other proper nouns from the Wall Street Journal. We have classified the proper nouns that we found in the A1-Raya newspaper as follows: Personal names: proper occupation organization nationality noun M. Evens Professor liT American Organization names: proper noun type liT university Byte mae:azine location service Chica~o education America computer Location (political names): proper noun type Chicago city Illinois State location language Illinois English America English Location (natural geographical names): proper noun type location Nile river Africa Atlantic ocean world Times: proper noun September Christmas part-of located-at months 9th holidays December Products: product name kind-of made-In vehicle Toyota Compaq computer Japan America Events: eventtype name .M-Kitab e.,dubition Madrid conference place )'ear specialist-on Egypt 1995 books A.spen 1993 peace Category (nationality, language, religion, ethnic, party,, etc.): proper noun t.vpe related-to American nationality America Arabic lan[Tta[e Arabs The Arabic language does not distinguish between upper/lower case letters like the English language. So the proper nouns do not begin with a capital letter. This makes it not nearly as easy to locate them in Arabic text as in English text. For this reason we will use another technique for tagging the proper nouns in the text. This technique depends on the ke}avords. We have studied, analyzed, and classified these ke3avords, to use them to guide us in tagging the proper nouns in the text and figuring out the t319es, and the features. We have classified these keywords as follows: • Personal names (title): Mr. John Adams • Personal names (job title): President John • organization names:Northwestern University • Locations (political names): State of Illinois • Location (natural names): Lake Michigan • Times: Month of September • Products: IBM Computer • Events: Exhibition of Egyptian books • Category: Arabic Language We have also developed a set of grammatical rules to identify the proper noun phrases in the text. Example: PNP --> JI A --> A1 I A2 AI --> ADFPN I ADFPN I ADFPN A2 -> ADJ I ADJ ADFPN -> ~I K/W-TITLE A
منابع مشابه
QARAB: A: Question Answering System to Support the Arabic Language
We describe the design and implementation of a question answering (QA) system called QARAB. It is a system that takes natural language questions expressed in the Arabic language and attempts to provide short answers. The system’s primary source of knowledge is a collection of Arabic newspaper text extracted from Al-Raya, a newspaper published in Qatar. During the last few years the information ...
متن کاملClassification Structuring Tagging
This paper presents an information extraction system that processes the textual content of classiied newspaper advertisements in French. The system uses both lexical (words, regular expressions) and contextual information to structure the content of the ads on the basis of predeened thematic forms. The paper rst describes the enhanced tagging mechanism used for extraction. A quantitative evalua...
متن کاملMulti-Tagging for Lexicalized-Grammar Parsing
With performance above 97% accuracy for newspaper text, part of speech (POS) tagging might be considered a solved problem. Previous studies have shown that allowing the parser to resolve POS tag ambiguity does not improve performance. However, for grammar formalisms which use more fine-grained grammatical categories, for example TAG and CCG, tagging accuracy is much lower. In fact, for these fo...
متن کاملFull Automatic Arabic Text Tagging System
Part-of-Speech tagging is the process of assigning grammatical part-of-speech tags to words based on their context. Many automated tagging systems have been developed for English and many other western languages, and for some Asian languages, and have achieved accuracy rates ranging from 95% to 98%. A tagged corpus has more useful information than untagged corpus; so, tagged corpus can be used ...
متن کاملHigh capacity steganography tool for Arabic text using 'Kashida'
Steganography is the ability to hide secret information in a cover-media such as sound, pictures and text. A new approach is proposed to hide a secret into Arabic text cover media using "Kashida", an Arabic extension character. The proposed approach is an attempt to maximize the use of "Kashida" to hide more information in Arabic text cover-media. To approach this, some algorithms have been des...
متن کامل